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Bayesian analysis of matrix data with rstiefel 

Peter Hoff * 
April 15, 2013 

Abstract 

We illustrate the use of the R-package rstiefel for matrix- variate data analysis in the context 
of two examples. The first example considers estimation of a reduced-rank mean matrix in the 
presence of normally distributed noise. The second example considers the modeling of a social 
network of friendships among teenagers. Bayesian estimation for these models requires the 
ability to simulate from the matrix- variate von Mises-Fisher distributions and the matrix- variate 
Bingham distributions on the Stiefel manifold. 



^^ 1 Exponential families on the Stiefel manifold 



The set of m x R matrices U for which U U = I/j is called the m x R Stiefel manifold and is 
denoted VR,m- The densities of a quadratic exponential family on this manifold (with respect to 
the uniform measure) are given by 



p(U\A, B, C) oc etr(C' U + BU' AU), 



(1) 



where C G M™^^, B is an i? x i? diagonal matrix and A is a symmetric matrix. Since IJ-^U = I, the 
density is unchanged under transformations of the form A — t- A -|- al or B — )• B -|- 61. Additionally, 
it is convenient to restrict the diagonal entries of B to be in decreasing order. If B is not ordered in 
this way, there exists a reparameterization (A, B, C) giving the same distribution as (A, B, C) but 
where B has ordered diagonal entries. More details on the Stiefel manifold and these distributions 



can be found in Chikuse (2003), Hoff (2009a), Hoff (2009b) and the references therein. 
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Distributions of this form were originally studied in the case R = 1, so that the manifold was 
just the surface of the ?7i-sphere. In this case, B reduces to a scalar and can be absorbed into the 
matrix A. The quadratic exponential family then has densities of the form 

p{u\c, A) oc exp(c^u + u'^Au). (2) 

The case that A = was studied by von Mises, Fisher and Langevin, and so a distribution with 
density proportional to exp(c^u) is often called a von Mises-Fisher or Langevin distribution on the 
sphere. The case that c = and A 7^ was studied by Bingham (1974[), and is called the Bingham 



distribution. This distribution has "antipodal symmetry" in that p(u|A) = p{—u\A), and so may 
be appropriate as a model for random axes, rather than random directions. 

In recognition of the work of the above mentioned authors, we refer to distributions with densi- 
ties given by (pj and M as vector-variate and matrix-variate Bingham-von Mises-Fisher distribu- 
tions, respectively. This is a rather long name, however, so in this vignette I will refer to them as 
BMF distributions. The case that A (or B) is the zero matrix will be referred to as an MF distri- 
bution, and the case that C is zero will be referred to as a Bingham distribution. More descriptive 
names might be L, Q and LQ to replace the names MF, Bingham, and BMF, respectively, the idea 
being that the "L" and "Q" refer to the presence of linear and quadratic components of the density. 

2 Model-based SVD 

It is often useful to model an m x n rectangular matrix-variate dataset Y as being equal to some 
reduced rank matrix M plus i.i.d. noise, so that Y = M + E, with the elements {eij : 1 < i < 
m, 1 < j < n} of E assumed to be i.i.d. with zero mean and some unknown variance a'^. The 
singular value decomposition states that any rank-i? matrix M can be expressed as M = UDV , 
where U G VR^m, V G Vr^u and D is an i? x i? diagonal matrix. If we are willing to assume 
normality of the errors, the model can then be written as 

Y = UDV^ + E 

E = {ejj : 1 < « < m, 1 < j < n} ~ i.i.d. normal(0, a ). 



Bayesian rank selection for this model was considered in Hoff (2007). In this vignette we con- 



sider estimation for a specified rank R, in which case the unknown parameters in the model are 



{U, D, V, o"^}. Given a suitable prior distribution over these parameters, Bayesian inference can 
proceed via construction of a Markov chain with stationary distribution equal to the conditional 
distribution of the parameters given Y, i.e. the distribution with density p(U, D, V, o"^|Y). In 
particular, conjugate prior distributions allow the construction of a Markov chain via the Gibbs 
sampler, which iteratively simulates each parameter from its full conditional distribution. If the 
prior distribution for U is uniform on Vu^m, then its full conditional density is given by 

p(U|Y,D,V,a2) cc p(Y\V,-D,V,a^) 

oc etr(-[Y-UDV^]^[Y-UDV^]/(2c72)) 
oc etr([YVD/CTYU), 

which is the density of an MF(YVD/cr^) distribution. Similarly, the full conditional distribution 
of V under a uniform prior is MF(Y"^UD/cj^). For this vignette, we will use the following prior 
distributions for {di, . . . , d^j, o"^}: 

{di, . . . ,dji\T } r^ i.i.d. normal(0, r ) 
1/r^ ~ gamma(77o/2,??oro/2) 
l/o" ~ gamma(i'o/2, fo<7o/2-) 

The corresponding full conditional distributions are 

{(ij|U,V,Y,d_j,cj^r^} ~ noTmal{T^ujYvj/[a^ + T^],T^a^/[T^ + a^]) 
{1/t2|U,D,V,Y,^2| ^ gamma([r?o+i?]/2,[%r2 + J]d2]/2) 

{l/a2|U,D,V,Y,r2} ~ gamma([z^o + "iH/2, K^o + l|Y-UDV^||2]/2). 

2.1 Simulated data 

We now randomly generate some parameters and data according to the model above: 

> library (rstief el) 

> set .seed(l) 

> m<-60 ; n<-40 ; R0<-4 

> UO<-rustiefel(m,RO) 

> VO<-rustiefel(n,RO) 



> DO<-diag(sort(rexp(RO) , decreasing=TRUE) ) *sqrt (m*n) 

> MO<-UO%*XDO%*Xt (VO) 

> Y<-MO + matrix(riiorm(n*m) ,m,n) 

The only command from the rstiefel package used here is rustief el, which generates a uniformly 
distributed random orthonormal matrix. Note that rustief el (m,R) gives a matrix with m rows 
and R columns, and so the arguments are in the reverse of their order in the symbolic representation 
of the manifold VR^m- 

2.2 Gibbs sampler 

Now we try to recover the true values of the parameters {Uq, Vo,Do,o"^} from the observed data 
Y. Just for fun, let's estimate these parameters with a presumed rank R > Rq that is larger than 
the actual rank. Equivalently, we can think of Uq, Vq, Dq as having dimension m x R, n x R and 
R X R, but with the last R — Rq diagonal entries of Dq being zero. 

The prior distributions for U and V are uniform on their respective manifolds. We set our 
hyperparameters for the other priors as follows: 

> nuO<-l ; s20<-l #inverse-gainma prior for the error variance s2 

> etaO<-l ; t20<-l #inverse-gamma prior for the variance t2 of the sing vals 

Construction of a Gibbs sampler requires starting values for all (but one) of the unknown parame- 
ters. An natural choice is the MLE: 

> R<-6 

> tmp<-svd(Y) ; U<-tmp$u [ , 1 : R] ; V<-tmp$v[,l :R] ; D<-diag(tmp$d[l:R] ) 

> s2<-var(c(Y-UX*7M*Xt(V))) 

> t2<-mean(diag(D'2)) 

Let's compare the MLE of D to the true value: 

> d.mle<-diag(D) 

> d.mle 

[1] 40.05172 25.00226 19.70827 13.43382 13.10381 12.64942 



> diag(DO) 

[1] 38.514216 24.015791 17.352783 1.169442 

The values of the MLE are, as expected, larger than the true values, especially for the smaller 
values of Dq. Now let's see if the Bayes estimate provides some shrinkage. 

> MPS<-matrix(0,m,n) ; DPS<-NULL 

> for(s in 1:2500) 
+ { 

+ U<-rmf. matrix (YX*'/X/c*XD/s2) 

+ V<-rmf. matrix (t(Y)%*XU%*%D/s2) 

+ 

+ vd<-l/(l/s2+l/t2) 

+ ed<-vd*(diag(t(U)X*XYX*XV)/s2 ) 

+ D<-diag(rnorm(R,ed,sqrt(vd))) 

+ 

+ s2<-l/rgamma(l, (nuO+m*n)/2 , (nu0*s20 + sum((Y-UX*XDX*Xt(V))~2))/2 ) 

+ t2<- 1/r gamma (1, (eta0+R)/2, (etaO*t20 + sum(D~2))/2) 

+ 

+ ### save output 

+ if(sXX5==0) 

+ { 

+ DPS<-rhind (DPS, sort (diag(ahs (D) ) , decreasing=TRUE) ) 

+ M<-UX*XDX*Xt(V) 

+ MPS<-MPS+M 

+ } 

+ } 

This generates a Gibbs sampler of 2500 iterations. Here, we save the values of D every 5th iteration, 
resulting in a sample of D- values of size 500 with which to estimate p(D|Y). Additionally, we can 
obtain a posterior mean estimate of Mq = UoDqVq via the sample average of UDV . Note that 
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Figure 1: Some output of the Gibbs sampler. 

this estimate is not of rank R, as the set matrices of less than full rank is not convex. If we want a 
rank R estimate, we could take the rank-i? approximation of the posterior mean. 

Let's look at the squared error for the MLE, the posterior expectation of Mq, and the rank-i? 
approximation to the posterior expectation: 

> tmp<-svd(Y) ; M.ml<-tnip$u[,l:R]X*Xdiag(tmp$d[l:R] )X*Xt(tmp$v[,l:R] ) 

> M.bl<-MPS/dim(DPS) [1] 

> tmp<-svd(M.hl) ; M.h2<-tmp$u[,l :R]X*Xdiag(tmp$d[l:R])X*Xt(tmp$v[,l :R]) 

> mean( (M0-M.ml)~2 ) 

[1] 0.3563462 

> mean( (M0-M.bl)'2 ) 
[1] 0.1315899 

> 222ean( (M0-M.b2)'2 ) 

[1] 0.1311898 

Not surprisingly, the MLE has a much larger loss than the Bayes estimates. The squared error for 
the two Bayes estimates are nearly identical. This is because although the posterior mean has full 
rank tti A n, it is very close to its rank-i? approximation. 
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Finally, let's make some plots based on the output of the Gibbs sampler. The left-most plot of 
Figure [l] gives simulated values of D, with the values of Dq given in thick lines. The mixing of the 
Markov chain looks pretty reasonable. The center plot gives Mq versus its posterior expectation, 
approximated from the MCMC sample average of UDV^. The right plot gives the MLEs of Dq 
in pink, the posterior expectations of Dq in light blue, and the true values in thin black lines. The 
posterior estimates are very accurate for the large singular values of Dq, but are overestimates for 
the smallest values (the last R — Rq of which are zero). However, these Bayes estimates are much 
better than the unregularized MLEs. 

3 Network analysis 

The package rstiefel includes a dataset on the social network and some health behaviors of a group 



of n = 50 Scottish teenage girls. These data were derived from the data available at http : //www 



stats.ox.ac.uk/~snijders/siena/s50_data.htm and described in Michell and Amos (1997). 



3.1 An eigenmodel for symmetric networks 

Let Y be the n x n symmetric adjacency matrix corresponding to this network, with off-diagonal 
entry yij equal to the binary indicator of a friendship between actors i and j, as reported by one 
or both actors. Li this vignette we will derive a model-based representation of these data using the 
following reduced-rank probit model: 

Zi,j = 6 + uf Auj + eij (3) 

ViJ = l(0,oo)(^J,j)) 

where {eij = Cj^i} ~ i.i.d. normal(0,l), A = diag(Ai,A2) and the matrix U with row vectors 
ui, . . . , u„ lies in the Stiefel manifold Vr^u- This model is a type of two-way latent factor model in 
which the relationship between actors i and j is modeled in terms of their unobserved latent factors 
Uj and Uj. This model and its relationship to other latent variable network models are described 



more fully in Hoff (2008). 



Convenient prior distributions for {U, A, 9} are as follows: 

9 ~ normal(0, Tg ) 

(Ai,A2) ~ i.i.d. normal(0,r;^) 

U ~ uniform(VR,n) 

Conditional on the observed network Y, posterior inference can proceed via a Gibbs sampling 
scheme for the unknown quantities {Z,U, A,0}. Under model (Is]), observing yij = or 1 implies 
that Zij is less than or greater than zero, respectively. Thus conditional on {Y, U, A, 9}, the distri- 
bution of Z is that of a random symmetric normal matrix with mean 9 + UAU and independent 
entries that are constrained to be positive or negative depending on the entries of Y. Given Z, 
the full conditional distributions of {U, A, 9} do not depend on Y, and can be obtained from the 
corresponding prior distributions and the density for the matrix Z, given by 

p(Z|U,A) oc etr(-[Z-eil^-UAU'^]'^[Z-011^-UAU^]/4) 

= etr(-E'^E/4) x etr(AU'^EU/2) x etr(-AV4), (4) 

where E = Z — 911^ has mean UAU"^ and off-diagonal variances of 1. The diagonal elements 
of E (and Z) have variance 2, but do not correspond to any observed data as the diagonal of 
Y is undefined. These diagonal elements are integrated over in the Markov chain Monte Carlo 
estimation scheme described below. From Ml), the full conditional distribution of U is easily seen 
to be a Bingham(E/2, A) distribution. Full conditional distributions for the other quantities are 



available via standard calculations, and are given in Hoff (2009a) and in the code below. 



3.2 Gibbs sampler 

The data for this example are stored as a list: 

> data(YX_scots) ; Y<-YX_scots$Y ; X<-YX_scots$X 

The n X 2 matrix X provides a binary indicator of drug use and smoking behavior for each actor 
during the period of the study. Understanding the relationship between these health behaviors 
and the social network can be facilitated by examining the relationship between X and the latent 
factors U that represent the network via the model given in ([s]). 

We specify the dimension of the latent factors and the values of the hyperparameters as follows: 



> ## priors 

> R<-2 ; t2.1ambda<-dim(Y) [1] ; t2.theta<-100 

A value of r| = n allows the prior magnitude of the latent factor effects to increase with n, but not as 
fast as the residual variance: Letting Ui be the first column of U, we have E[||AiUiUj^|p] = E[Af] = 
n. On the other hand, letting £ be the matrix of residuals {eij} , we have E[||iS|p] = (n + l)n. 
For brevity, we consider simple, naive starting values for the unknown parameters: 

> ## starting values 

> theta<-qnorm (mean (c (Y) , na . rm=TRUE) ) 

> L<-diag(0,R) 

> set .seed(l) 

> U<-rustiefel (dim(Y) [1] ,R) 

Better starting values could be obtained from a few iterations of an EM or block coordinate descent 
algorithm, although these naive starting values are adequate for this example. 

We are now ready to run the Gibbs sampler. We will store simulated values of A and 6 in the 
objects LPS and TPS, respectively. Instead of saving values of U, we will just compute the sum 
of UAU^ across iterations of the Markov chain. Dividing by the number of iterations, this sum 
provides an approximation to the posterior mean of UAU . A rank-i? eigendecomposition of the 
posterior mean can be used to provide an estimate of U. 

> ## MCMC 

> LPS<-TPS<-NULL ; MPS<-matrix(0,dim(Y) ,dim(Y)) 

> for(s in 1:10000) 

+ { 

+ 

+ Z<-rZ_fc (Y, theta+UX*XLX*Xt (U) ) 

+ 

+ E<-Z-U%*%LX*Xt(U) 

+ v.theta<-l/(l/t2.theta + choose (dim (Y) [1] ,2)) 

+ e. theta<-v . theta*sum (E [upper . tri (E)] ) 

+ theta<-rnorm (l,e.theta, sqrt (v . theta) ) 
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+ 

+ E<-Z-theta 

+ V . Iamhda<-2*t2 . lambda/ (2+t2 . lambda) 

+ e . lambda<-v . lambda*diag(t (U)%*%Ey,*%U/2) 

+ L<-diag(rnorm (R,e. lambda, sqrt (v . lambda) ) ) 

+ 

+ U<-rbing.matrix.gibbs(E/2,L,U) 

+ 

+ ## output 

+ if(s>100 & s°niO==0) 

+ { 

+ LPS<-rbind (LPS, sort (diag(L))) ; TPS<-c(TPS,theta) ; MPS<-nPS+U%*%L7,*%t (U) 

+ } 

+ } 

Note that this code uses a function rZ_f c, which simulates from the full conditional distribution 
of Z given {Y,U, A, 0}, which is that of independent constrained normal random variables. The 
code for this function can be obtained from the IM^gX source file for this document. 

A summary of the posterior distribution is provided in Figure [2] The first panel plots the 
posterior density of 9, and the second plots the (marginal) posterior densities of the ordered values 
of (Ai,A2). This plot strongly suggests that the values of Ai and A2 are both positive. Since 
the probability of a friendship between i and j is increasing in uJAuj, the results posit that 
friendships are more likely between individuals with similar values for their latent factors (this 
effect is sometimes referred to as homophily). The third panel plots the observed network with the 
node positions obtained from the estimates of ui, . . . , u„ based on the rank-2 approximation of the 
posterior mean of UAU"^. The plotting colors and characters for the nodes are determined by the 
drug and smoking behaviors: Non-smokers are plotted in green and smokers in red, non-drug users 
are plotted as circles and drug users as triangles. The plot indicates a separation between students 
with no drug or tobacco use (green circles) from the other students in terms of their latent factors, 
suggesting a relationship between these health behaviors and the social network. 
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Figure 2: Some output of the Gibbs sampler. 
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